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Abstract 

Background: Prioritization methods have become an useful tool for mining large amounts of data to suggest 
promising hypotheses in early research stages. Particularly, network-based prioritization tools use a network 
representation for the interactions between different biological entities to identify novel indirect relationships. 
However, current network-based prioritization tools are strongly tailored to specific domains of interest (e.g. gene- 
disease prioritization) and they do not allow to consider networks with more than two types of entities (e.g. genes 
and diseases). Therefore, the direct application of these methods to accomplish new prioritization tasks is limited. 

Results: This work presents ProphNet, a generic network-based prioritization tool that allows to integrate an arbitrary 
number of interrelated biological entities to accomplish any prioritization task. We tested the performance of ProphNet 
in comparison with leading network-based prioritization methods, namely rcNet and DomainRBF, for gene-disease and 
domain-disease prioritization, respectively. The results obtained by ProphNet show a significant improvement in terms of 
sensitivity and specificity for both tasks. We also applied ProphNet to disease-gene prioritization on Alzheimer, Diabetes 
Mellitus Type 2 and Breast Cancer to validate the results and identify putative candidate genes involved in these diseases. 

Conclusions: ProphNet works on top of any heterogeneous network by integrating information of different types 
of biological entities to rank entities of a specific type according to their degree of relationship with a query set of 
entities of another type. Our method works by propagating information across data networks and measuring 
the correlation between the propagated values for a query and a target sets of entities. ProphNet is available at: 
http://genome2.ugr.es/prophnet. A Matlab implementation of the algorithm is also available at the website. 



Background 

The advancements in high-throughput technologies such 
as DNA sequencing, linkage analysis, association studies 
and expression arrays have fostered the research towards 
an effective personalized medicine. To this end, the inte- 
gration of pieces of evidence of different nature derived 
from diverse data sources is required, together with algo- 
rithms able to mine these data and identify novel biological 
facts of relevance. Networks have been shown to be an use- 
ful representation for combining heterogeneous biological 
data. Currently, there is a huge availability of large molecu- 
lar networks such as protein-protein interaction (PPI) 
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networks, which model interactions between proteins. 
Many methods have been proposed in the literature to 
represent and mine knowledge from biological networks 
[1]. For example, [2] proposes to apply text-mining in 
OMIM to generate a similarity network for human diseases 
and [3] builds a gene network based on the results of 
microarray experiments. These approaches have led to the 
emergence of new methods that exploit and integrate dif- 
ferent data sources into networks in a variety of ways [4] . 
Inferring new knowledge from existent networks is usually 
based on "guilt-by- association" [5] . This extensively vali- 
dated principle states that biological entities which are 
associated or interacting in a network are more likely to 
share a common function. This principle allows to infer 
new relationships from already known interactions. 
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In this context with massive amounts of highly intercon- 
nected data is where prioritization methods are required. 
Prioritization tools are based on computational approaches 
that use information retrieved from diverse sources in order 
to obtain ranked lists of candidate biological elements 
(genes, proteins, diseases, etc.) related with a certain target 
element. Gene-disease prioritization, in which genes are 
ranked according to their relevance to a disease of interest 
(or vice versa), is the most popular prioritization task, and 
many methods have been proposed in the recent literature 
to accomplish this task [6]. Most of these methods focus on 
the analysis of phenotype and PPI networks for gene-disease 
prioritization. These methods weight the arcs connecting 
two proteins or phenotypes according to a measure of the 
similarity between them. CIPHER [7] computes correlation 
coefficients based on linear regressions of phenotype and 
PPI profiles. PRINCE [8] computes the relevance of a gene 
by using network propagation methods. RWRH [9] scores 
genes and diseases using a random walk approach on PPI 
and phenotypes networks. rcNet [10] proposes a methodol- 
ogy for prioritization of candidate genes based on propagat- 
ing node values and measuring the degree of correlation 
between two sets of nodes, one in the PPI/gene network 
and one in the phenotype network. Network-based gene- 
disease prioritization methods have been proven to provide 
better results than previous approaches [11-15]. 

Apart from gene-disease prioritization, other methods 
have been proposed to perform a prioritization of other 
biological entities. DomainRBF [16] performs a prioritiza- 
tion of protein domains for diseases using Bayesian linear 
regression. This method assumes a key role for protein 
domains in diseases as shown by previous studies [17]. 
Domains are basic structural and functional units of pro- 
teins, which in turn are composed of multiple structural 
domains, each one closely linked to a specific function. 
Although DomainRBF exploits the functional role of 
protein domains in phenotypes, it does not explore 
the simultaneous integration of PPI, domain and pheno- 
type networks for gene or disease prioritization. Despite 
the good performance obtained by the mentioned priori- 
tization methods, they have clear limitations. First, exist- 
ing network-based prioritization methods do not allow 
to consider more than two types of networks for 
performing the prioritization (e.g. gene and disease net- 
works in rcNet and domain and disease networks in 
domainRBF). Only non-network-based methods have 
succeeded in integrating more than two different types of 
entities for prioritization. For example, Endeavour [13] 
performs an independent prioritization of different enti- 
ties using multiple heterogeneous generic data sources 
which are integrated on a single global ranking using 
order statistics. However, previously mentioned network- 
based methods have been shown to outperform this 
method using a lower amount of data sources [7]. 



Second, existing prioritization methods are strongly tai- 
lored to a specific domain of interest (for example gene- 
disease prioritization for rcNet and protein domain-disease 
prioritization for domainRBF, respectively). In our opinion, 
these two drawbacks dramatically limit the applicability of 
these methods to other prioritization tasks and do not 
allow to improve the results by integrating information 
about other types of related entities. 

In this work we present ProphNet, a generic prioriti- 
zation method that outperforms previous methods by 
integrating and propagating information in an arbitrary 
number of heterogeneous data networks. Our method is 
generic since it allows to prioritize biological entities of 
any type with respect to biological entities of another 
type. Therefore, the user can customize the goal of the 
prioritization task (disease-gene, domain-disease, drug- 
disease, etc.). Furthermore, the user is not restricted to 
the use of only two entities, and can integrate as many 
biological networks as desired. 

To compare the results obtained by ProphNet with those 
obtained by state-of-the-art methods, such as rcNet and 
domainRBF, we applied ProphNet to the prioritization of 
genes-diseases and domains-diseases, respectively, on a 
network built as the integration of protein domain, PPI 
and phenotype networks. ProphNet measures the influence 
of a query set of biological entities of a certain type (e.g. 
genes or diseases) in a target set of entities of another type 
(e.g. diseases or genes, respectively). To this end, the algo- 
rithm uses a graph representation as shown in Figure 1. In 
this representation, each node corresponds to a biological 
entity of a domain of interest (gene/protein, disease, pro- 
tein domain, etc.), and the arcs between two nodes are 
labelled with a weight representing the strength of the rela- 
tionship between the connected entities. These weights are 
derived from different databases and other biological 
sources and their interpretation varies depending on the 




Network A Network C 



Figure 1 Problem Overview. Our problem is to determine how 
related the query set and the target set are based on known 
relations between elements. 
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type of the connected entities and the final goal of the 
study (e.g. physical/structural similarity, regulatory depen- 
dence, similar functional roles, etc.). In our algorithm, the 
arc weights for each network are compiled in an adjacency 
matrix. The nodes of the graph are also labelled with a 
value (in [0, 1]), representing the degree of association to 
the query or the target set. There are two kinds of net- 
works: a) networks representing interactions or similarities 
between entities of the same type, and b) networks repre- 
senting interactions or similarities between entities of dif- 
ferent type. Type b) networks are used to interconnect type 
a) networks. 

The method we propose allows to propagate node values 
through paths along different data networks (representing 
different biological entities) in order to derive new infor- 
mation from the existing knowledge. This value propaga- 
tion is performed in two directions. First, values are 
propagated within and between networks, through all 
the possible paths connecting the query set network to the 
target set network (not reaching the target set network). 
Second, values are also propagated within the target set 
network, starting from the target nodes. Both propagation 
processes follow the principle that the higher the weight of 
the arc between two entities is, the more similar the value 
of these two nodes should be. Therefore, these two label 
propagation processes derive a final graph in which the 
value assigned to a node represents its degree of relation 
with the query or target set, respectively. Finally, we mea- 
sure the degree of relationship between the query and tar- 
get sets by computing the correlation between the values 
assigned to the nodes in the target network and those 
assigned to their neighbour nodes in other networks, as 
proposed in previous works with good results for different 
prioritization tasks [7,10]. This process is explained in 
detail in the following section. 

This article is organized as follows. The method and the 
data sources are described in detail in section Methods. To 
validate the proposed methodology we integrate protein 
domain, PPI and phenotype networks and compare the 
results to those obtained by rcNet for gene prioritization 
and DomainRBF for domain prioritization. These results 
are presented in the Results section and show a significant 
improvement in terms of sensitivity and specificity. Proph- 
Net is also applied to several case studies (namely Alzhei- 
mer, Diabetes Mellitus Type 2 and Breast Cancer) to 
identify putative candidate genes involved in these dis- 
eases. The results of these tests can be found in the section 
Case Studies. Finally, some conclusions and future work 
are presented. 

Methods 

Let D be a set of graphs (also referred to as networks) 
defined as D t = (Vj, £,) for i = 1, n, where V t is a set of 



vertices which represent biological entities from a specific 
domain satisfying V, n V, = 0, V/, / such that i * j. Each 
node v ik (with k = 1, | V,|) in D t is labelled with a 
value *P(v ;Vt ), initially set to zero, that indicates the degree 
of relationship to the query or target set, depending on 
the network v ik belongs to. £, is a set of weighted undir- 
ected arcs representing relationships, similarities or inter- 
actions between elements of V,. The values of the nodes 
change while the weights of the arcs remain constant 
during the entire process. Let R be a set of graphs defined 
as Ry = (VjU Vj, Q,), where Q, is a set of weighted undir- 
ected arcs representing relationships, similarities or inter- 
actions between elements of V t and Vj, with i, j 6 1, n 
and i * j. Therefore, R t j describes the relationships 
between the biological entities from two different net- 
works: Di and Dj. 

We define the heterogeneous global graph G as G = 
(D, R). Let the graph D q e D be the query network and 
let D t G D be the target network. Given the global 
graph G, our goal is to find the degree of association 
between a set of nodes Q £ V q called the query set and 
a set of nodes T Q V t called the target set. 

The initial values for the nodes in the query set are set 
to 1 CP(v ? j) = 1 for all nodes v qi G Q), while the rest of 
the nodes are set to 0 (¥(%) = 0 for nodes v qj &V q - Q). 
The target network is initialized in the same way, but 
considering the nodes in V t and T. The rest of nodes in 
G are initially set to 0. 

As we explain below in more detail, our method per- 
forms a propagation within networks pumping informa- 
tion between nodes. This process is based on the Flow 
Propagation algorithm [18,19], which uses the normal- 
ized Laplacian matrix to propagate labels between nodes 
in a network. The normalization takes into account the 
degree of each node to limit the bias toward annotations 
from high-degree nodes. This normalization is also criti- 
cal for convergence. The Flow Propagation algorithm is 
similar to a Random Walk with Restart, basically differ- 
ing in the normalization process that guides the propa- 
gation [19]. 

Let N be the non-normalized adjacency matrix of a net- 
work in G. Since G = (D, R) and graphs in R are bipartite 
(i.e. the adjacency matrices of graphs in R are not 
squared), let assume N has r rows and c columns. A nor- 
malization for N can be computed as: 

norm{N) = DqND^, 

where D l G and are diagonal matrices where each 
component is defined as: 



Martinez et al. BMC Bioinformatics 2014, 15(Suppl 1 ):S5 
http://www.biomedcentral.eom/1 471 -2 1 05/1 5/S1 /S5 



Page 4 of 13 



for /' = 1, .., r, and 




for k = 1, .., c. 

We define M = {M t | M t = norm{D^ where i = 1, .., |D|} 
as the set of normalized squared adjacency matrices of 
graphs in D. Similarly, we define 5 = {5, | S t = norm(Rj) 
where i = 1, .., as the set of normalized adjacency 
matrices of bipartite graphs in R. 

Let pi = {pn, pip pa) be a path composed of net- 
works from D connecting D q and D t , satisfying pij G D, 
p n = Dq, pa = D t and p i} * p ih V; * k. To compute the 
degree of association between the query and target sets, 
we first propagate values from the query set within the 
query network, and from the target set within the target 
network, as described in Section Value propagation 
inside networks. Next, we identify all the possible paths 
P = {p lt p\ P \} connecting the query network to the 
target network. Figure 2 shows an example of a global 
graph G composed of five different networks or 
domains, with three different paths connecting the 
query network to the target network. Since the number 
of networks is usually reduced, the computation of all 
the paths connecting D q and D t can be accomplished by 
a brute force algorithm. For each computed path p it we 
propagate information from p t j to the following network 
Pi(j+i) in the path, as described in Section Value propa- 
gation between networks. Then we propagate informa- 
tion within the network Puj+i), where /' = 1, 2, I - 2. 



The propagation continues until it has been performed 
within the network Pi(i-\) in the path. 

Finally, after performing this propagation through each 
path in P, we correlate the values of the nodes in D t 
against the values of the nodes in p^x) directly connected 
to those in D t , for all the paths p t £ P. This step is 
described in Section Value correlation between networks. 
The obtained correlation value determines the degree of 
relationship between the query set and the target set. 

Although measuring the degree of relationship 
between the query and target sets by correlating node 
values seems less intuitive than continuing the propaga- 
tion of node values from the neighbours networks to 
the target nodes, the former approach has been shown 
to perform better than the latter (see Additional file 1). 
Therefore, it was selected as the measure of similarity in 
our method. This approach was proposed in previous 
network-based prioritization methods with good results 
for different prioritization tasks [7,10]. In order to 
accomplish prioritization tasks, in which only the query 
set Q G V q and the target network V t are provided by 
the user, we embed this pipeline into an iterative pro- 
cess to score each node in the target network according 
to its relationship with Q. This process is described in 
Section Prioritization process. 

Value propagation inside networks 

Several propagation methods have been proposed to 
compute the similarity or distance between nodes within 
a graph [4]. Methods based on local neighbourhood or 
shortest paths fail in capturing global interactions, in 
contrast to global methods that take into account the 
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Figure 2 Example of path computation. Example of computed paths. Three paths have been obtained connecting the query network to the 
target network. 
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entire network topology [20]. ProphNet implements a 
flow propagation approach [18,19] that uses a network's 
global information to perform a propagation of the 
values assigned to the nodes within this network. To 
carry out this propagation process within a network D k , 
we first define the prior information set Z as those ver- 
tices v kj with *P(v^) * 0. Therefore, the prior information 
set matches Q when propagating values within the query 
network, and the prior information set matches T when 
propagating values within the target network. The value 
*F(Vjy) of each node v kj in Z (j G [1, | V k \\) is normalized as: 



Let # 0 be a vector compiling the values initially 
assigned to each node in D k , and Sc a vector with the 
values assigned to each node in D k after performing 
the propagation within D k . To calculate x we need to 
solve the following optimization problem: 



1 



■J2{Xi- x 0i f 



mm^Mk.^Xi-Xj) 2 



where M k is the network's normalized adjacency 
matrix. The closed-form solution of this equation is: 

x = (1 — a)(I — aMk)~ 1 Xo. 

This linear system can be solved exactly. However, 
there exists an iterative algorithm for solving this system 
which is faster for large networks [21]: 

Xui = aMkXi + (1 — a)xo 

with i starting from 0. This algorithm implements an 
iterative process where each node propagates its node 
value to its neighbours, based on the weights of the arc 
connecting them. The parameter a G [0, 1] determines 
the importance of the prior information set. 

In order to further speed up this iterative process, we 
define the following stopping criterion: \x t - x i+ \\ < k. 
This allows to stop the iterative process when it has 
almost converged, without the need of full convergence. 
Experimental tests (results not shown) prove that the 
best performance is obtained for k < 1(T 3 . 

For convenience, we refer to the vector obtained 

after convergence, where each component represents 
the value assigned to each node in the network D k after 
performing the propagation within D k , as part of a pro- 
pagation process through the path pj. Since the propaga- 
tion values within the query and target networks are not 
affected by the propagation processes through the paths 
in P, we define the vector obtained after propagat- 
ing nodes values within the query network, and x t as the 



vector obtained after propagating nodes values within 
the target network. 

Value propagation between networks 

Given a network D t whose vertices are already assigned 
a value according to xu, we further propagate these 
values to the next network Dj in the current path p b 
with Dj * D t . Since Dj and Dj are connected by R t j, the 
information is propagated from the nodes of D t to the 
nodes of Dj across the edges of by assigning each 
node v jk from Dj a value computed as the mean of the 
nodes from D t the node v jk is connected with. This 
expression is formalized as: 

\neigi{v jk )\ 

where neigj(Vj k ) is the set of nodes from D t that are 
directly connected to Vj k according to Ry. A thresholding 
step is applied to this propagation process, since detailed 
experimentation suggested that nodes with very low values 
add noise to the process and reduce the performance (see 
Additional file 2). To this end, a parameter y e (0, 1] is 
included in the process so that the r|V}|(l - 7)1 lowest 
node values from Dj after the propagation are updated to 
^{Vjk) = 0. The rest of the node values are not changed. 

Value correlation between networks 

After the propagation process through one path finishes, 
the nodes in the networks which are adjacent to the tar- 
get network present values that determine their degree 
of relationship to the query set. Also, the nodes in the 
target network are assigned a value that determines the 
degree of relationship with the target set. We can indir- 
ectly measure the relationship between the query set 
and the target set by measuring the similarity between 
the values of the nodes in the target network and those 
that are directly connected to them in adjacent net- 
works. This can be calculated by simultaneously corre- 
lating these node values as derived by the propagation 
processes through all the different paths. For each path 
Pi with length I a vector x, is computed as: 

Xj = S a X(/_i)j 

where S a is the normalized adjacency matrix of R(i-i)([) 
and X(/-i)i is the vector obtained after propagating 
values inside the network D bl . 

Since the values of the target network after the propa- 
gation process are represented by x t , we define the 
vector t as: 

|P| times 



t = concat (x t , . . . , x t ) 
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and the vector x as: 
x = concat(xi) Vi e [1, |P|] 

where concat means concatenation. Both x and t are 
the same size. 

Finally, the correlation value which derives a measure 
for the relationship between the query and target sets is 
computed as: 

s = con (x, t) 

where corr is Pearson's Correlation. 

Prioritization process 

In order to obtain a prioritized list of targets for a 
query set of nodes, we have to follow an iterative 
approach. For each target network node v te , we set it 
as the target set T and compute the correlation value 5 
as described in the previous section (we called this 
correlation s e since it is computed for T = {v te }). Once 
this correlation value has been computed for each tar- 
get network node, these nodes are sorted in decreasing 
order according to their s e value. Target nodes with 
high values of s e are supposed to be strongly related to 
the query set. The entire algorithm is described in the 
pseudocode Algorithm 1 
Algorithm 1 ProphNet 

prioritize(G: global graph, Q: query set, D q : query net- 
work, D t : target network) 

Propagate values within D q 

P: Compute the list of paths from D q to D t in G 
for each path p t = {p n , p ip p u } in P do 

for each network p t j in the path p t from p a to 
PHi-i) do 

Propagate values from py to puj+i) 
Propagate values within puj+i) 
end for 

Store the values of A(M) after propagation 
through path p t as X;(/-i) 
end for 

for each entity e G V t in the target network D t do 
Set target set T = {e} 
Propagate values within D t 

Compute correlation coefficient s e using the 
stored x,(;_i) for each path p t . 
end for 

L: Sort all entities e G V t by their s e values in des- 
cending order 
return L 

Prioritization example 

To facilitate the understanding of the algorithm, the 
Figure 3 shows a step-by-step representation of the 
ProphNet propagation processes. This figure shows two 



examples of a prioritization task involving three net- 
works or domains, with the elements of each network 
represented by circles, squares and diamonds, respec- 
tively. For simplicity and clarity, node values are repre- 
sented using a grey color scale (from white representing 
value 0 to black representing value 1) and the weight of 
an arc is represented by its line width. In the two exam- 
ples, the prioritization task involves the same target set 
but different query sets. The query and target sets con- 
tain only one element in both cases, which is initially 
(step 1) set to 1 (black). Node values are propagated 
from the query nodes within the query network (step 2), 
and from the target nodes within the target network 
(step 3). There are two paths connecting the query net- 
work and the target network in these examples (circles- 
squares and circles-diamond-squares, respectively). 
Values from the query network are then iteratively pro- 
pagated to adjacent non-target networks. Since the 
query network is directly connected to the target net- 
work in one of the paths, this step (step 4) is only 
applied to the path which includes an intermediate net- 
work (diamonds). Then, the propagation is performed 
within this intermediate network (step 5). This propaga- 
tion continues until all the networks in all the paths 
connecting the query and target sets have been reached. 
Finally, we measure the strength of the connection 
between the query and the target sets as the correlation 
between the values assigned to the nodes in the target 
network and the values assigned to their neighbours 
from other networks (step 6). 

Figure 3a shows an example in which the values pro- 
pagated from the query and target sets are highly corre- 
lated, suggesting a strong relationship between them. 
On the other hand, Figure 3b shows an example with 
low correlation values, which suggests that query and 
target sets are not related. 

Algorithm complexity 

The time complexity of the algorithm shown in the 
pseudocode Algorithm 1 can be determined by aggregat- 
ing the time complexity of each task it is composed of. 
Let n be the number of nodes in a network and m the 
number of networks in the global graph G. The task of 
propagating values within a network is 0(n 3 ). The pro- 
pagation of values between networks is 0(n 2 ). The com- 
putation of the correlation coefficient for one path is 
0(« 3 ). The number of paths is bounded by ml and their 
length by m. Therefore, the computational complexity 
of ProphNet is bounded by 0(m\ x m x n 3 ). Despite 
this high complexity, typical execution times are a few 
seconds since the value of m is usually small in real 
applications. A summary of ProphNet execution times 
and memory usage for the experiments shown in this 
paper can be found in Additional file 3. 
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Results 

As two specific case studies, we have applied ProphNet to 
prioritize candidate genes and protein domains associated 
to diseases. ProphNet has been compared with rcNet for 
gene-disease prioritization and with DomainRBF for 
domain-disease prioritization, since these methods were 
recently proposed and reported better results than pre- 
vious methods [10,16]. ProphNet was run on a global 
graph composed of diseases, genes and protein-domains 
interconnected networks, while rcNet and DomainRBF 
were run on a genes-diseases and domains-diseases net- 
works, respectively, as proposed by their authors. It is 
important to note that the ProphNet base case execution 
using only the gene and disease networks would obtain 
the same results than rcNet. The data sources used are 
described in detail in Section Data sources. 

We tested the performance of the different methods on 
several leave-one-out (LOO) cross-validation experiments 
and for predicting new associations recently added to 
OMIM. To measure the performance of the different 
prioritization methods, we used Receiver Operating Char- 
acteristic (ROC) curves. ROC curves plot the true positive 
rate vs. the false positive rate at various threshold settings. 
The area under the ROC curve (AUC) was also computed. 
Finally, the average ranking position of the true target in 
the prioritized lists obtained by each method was also 



computed and normalized by dividing by the total number 
of elements in the list (5080 diseases in this case). We also 
computed p-values for comparing the average ranking 
values using two-tailed t-tests and the Bonferroni 
correction. 

For the results reported for ProphNet, a was set to 
0.9, the error threshold in the flow propagation was set 
to n = 10 s and y = 0.00375. For rcNet, we set the para- 
meters to the values providing better results according 
to the authors: a = 0.9, ft = 0.9 and k = 10 5 [10] and 
used the enumeration-correlation based version. 

Data sources 

The disease phenotype network has been extracted from 
OMIM [22] using text mining techniques as described 
in [2]. Also, to perform a fair comparison of the results 
to those reported by rcNet, we used a version of OMIM 
from May, 2007 [10]. The obtained disease network 
contains 5080 OMIM disease phenotypes. The arcs are 
weighted with a value in the range [0, 1]. This weight 
measures the similarity between two phenotypes as the 
inverse of the distance between the feature vectors 
obtained by counting the occurrences of each term from 
the anatomy and disease sections of the Medical Subject 
Headings Vocabulary (MeSH) in the description text for 
the corresponding entries in OMIM. The obtained 
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disease network contains a total of 19,729 weighted 
interactions. 

The gene network has been obtained from the Human 
Protein Reference Database (HPRD [23]). This protein- 
protein interaction network contains 32,331 unique inter- 
actions between 8,919 proteins. The network connecting 
genes and phenotypes has been extracted directly from 
OMIM (phenotype-gene relationship fields) obtaining 
1,393 relationships. 

The domain network has been derived from DOMINE 
[24] and InterDom [25] containing 48,778 unique relations 
between 5,490 domains. Relations between domains and 
genes were extracted from pFam [26]. Relations between 
domains and phenotypes have been extracted from Pfam 
and annotations of nsSNPs in the UniProt database [27] . 

The three networks (genes, protein domains and dis- 
eases) have simultaneously been used in the experiments 
performed with ProphNet. RcNet was executed using only 
the gene and disease networks, since this method does not 
allow to integrate more than two networks. DomainRBF 
was run on the domain and disease networks due to the 
same limitations. 

Gene-disease validation 

To check whether the prioritization methods rcNet and 
ProphNet were able to retrieve a known relationship 



between a gene and a disease, we performed a leave- 
one-out cross-validation using gene-phenotype relations 
from OMIM. For each gene-phenotype relation reported 
in OMIM, we run the two algorithms on a network in 
which the explicit arc connecting the gene and pheno- 
type of interest was removed. The gene of interest was 
set as the query set and the methods were asked to rank 
all the phenotypes associated to this query set. 

The obtained ROC curves are shown in Figure 4. AUC 
values and avg. rank values for the target disease are dis- 
played in Table 1. We can see that ProphNet outper- 
forms rcNet in this test, ranking the target phenotype in 
a significantly higher position (corrected p-value < 0.05), 
with lower standard deviation and obtaining better AUC 
values. The high difference in terms of AUC value (over 
10%) also suggests that the achieved improvement is not 
due to ProphNet prioritizing a little better those targets 
poorly prioritized by rcNet, but these targets being 
prioritized at the top by our method while they are 
poorly prioritized by rcNet. It is also important to note 
that, although a high percentage of the cases were prior- 
itized in the top of the ranking, we found some results 
that were really worse ranked by both methods, signifi- 
cantly increasing the mean ranking and setting it far 
from the top 1 position. This also applies to experi- 
ments described in the following two sections. 




0.3 
0.2 
0.1 
0 



[Gene-Disease LOO] ProphNet 

[Gene-Disease LOO] rcNet 

[Gene-Disease new associations] ProphNet 
[Gene-Disease new associations] rcNet 



0.2 



0.8 



0.4 0.6 
1 -Specificity 

Figure 4 ROC curves ProphNet vs. rcNet. ROC curves for gene-disease prioritizations with ProphNet and rcNet. 
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Table 1 Tests results 



Test 


Method 


AUC 


Normalized 








mean ranking (Std.Dev) 


Gene-disease 


ProphNet 


0.9393 


0.0609 (0.1597) 


LOO 


rcNet 


0.80572 


0.1944 (0.2448) 


Gene-disease 


ProphNet 


0.80717 


0.1930 (0.2618) 


new associations 


rcNet 


0.71636 


0.2835 (0.2907) 


Domain-disease 


ProphNet 


0.9319 


0.0683 (0.1537) 


LOO 


domainRBF 


0.8678 


0.1322 (0.2361) 



Performance comparison for leave-one-out cross-validation prioritization 
experiments using OMIM. 



Gene-disease validation with new OMIM associations 

Another validation that we have performed is predicting 
new associations between phenotypes and genes in 387 
case studies from new entries added to OMIM between 
May 2007 and May 2010, since these relationships are not 
reported in the datasets used in our study. Each case study 
consists of a phenotype and a set of genes (mostly only 
one) associated with it. Results of the comparison can be 
seen in Figure 4. AUC values are shown in Table 1. The 
results show that our algorithm clearly outperforms rcNet 
(corrected p-value < 0.05) predicting new relationships 
not explicitly represented in the data network. 

Domain-disease validation 

To prove that our algorithm not only prioritizes genes, but 
can prioritize other biological entities, we have performed a 
leave-one-out domain-disease validation test. For each rela- 
tion between a domain and a phenotype in our datasets, we 
run the prioritization methods on a global network in 
which the direct arc connecting the protein domain and 
phenotype of interest was removed. The domain of interest 
was set as the query set and the methods were asked to 
rank all the phenotypes associated to this query set. 

Our method has been compared with domainRBF for 
this task, since this method has been recently proposed for 
domain-disease prioritization and builds the phenotype- 
domain network using the same data sources considered 
in this study. We set the parameters of domainRBF testing 
for best performance. A diffusion kernel was selected to 
compute distances in interactions matrices. B 0 and V 0 
were set to 0 and 1, respectively. 

Results show that our method significantly improves the 
results provided by domainRBF for disease-domain priori- 
tization (corrected p-value < 0.05). The highest difference 
in performance is around AUC 10%, which suggests that 
our method prioritizes more target phenotypes in the top 
of the ranking. ROC curves for this comparison can be 
seen in Figure 5. 

Robustness analysis 

We carried out several experiments to test the robustness 
of ProphNet to parameter variation. First, we checked 



that varying the parameter a, which controls the impor- 
tance of prior information in label propagation, does not 
significantly affect performance, as previous works sug- 
gested for other methods [10,18]. Values ranging between 
0.5 and 0.9 reported similar performance for ProphNet, 
but best result were obtained with a set to 0.9. 

Second, we tested the impact of varying the parameter 
y in the results, y was used to limit the propagation of 
noise in the label propagation between different net- 
works. The experiments showed that, although for any 
value of 7 (in [0, 1]) ProphNet reported a good perfor- 
mance, the best results were obtained for y in [0.002, 
0.004]. 

Results from these experiments can be found in Addi- 
tional file 2. 

Case studies 

In order to show the applicability of the proposed 
method on real case studies, we have used it for gene- 
disease prioritization of some multifactorial disorders 
such as Alzheimer Disease, Diabetes Mellitus Type 2 
and Breast Cancer, using the data sources described in 
section Data sources. Parameters were set to those 
which reported better results in the validation experi- 
ments (a = 0.9, 7 = 0.00375 and k = 10 5 ). A list of the 
genes ranked in the top positions for each disease are 
shown in Table 2, together with their assigned score. A 
detailed discussion about the role of these genes in the 
associated diseases can also be found in Additional file 4. 
A full list can be obtained by running the tool at the 
ProphNet website. 
Results for Alzheimer Disease 

Our method was used to prioritize genes related to Alz- 
heimer Disease (MIM:104300). Table 2 shows genes 
ranked in the top positions which were previously 
known (OMIM records) to be connected with Alzhei- 
mer, such as APP and PSEN2. Furthermore, new rela- 
tionships not explicitly reported in OMIM are suggested 
by analysing other genes in the top 10. For example, 
MAPT was ranked 3th in the obtained prioritized list. 
This gene provides the instructions for making a protein 
called tau that can be found throughout the nervous sys- 
tem (including neurons of the brain) so it has been asso- 
ciated with Alzheimer [28]. PSEN1, with known relations 
to Alzheimer type 3 [29] was ranked 4th. TREM2 was 
ranked 5th, suggesting an important role in Alzheimer as 
shown by some population studies [30,31]. HDIHTT was 
ranked 6th, and although this gene has not yet been 
directly associated with Alzheimer, it has been shown to 
affect Huntington's disease [32]. More details about the 
other genes in the top 10 are provided in Additional file 4. 
Results for Diabetes Mellitus Type 2 

Our method was used to prioritize genes related to 
Diabetes Mellitus (DM) Type 2 (MIM:125853). Genes 



Martinez et al. BMC Bioinformatics 2014, 15(Suppl 1)55 
http://www.biomedcentral.eom/1 471 -2 1 05/1 5/S1 /S5 



Page 1 0 of 1 3 




previously known to be connected with the disease, accord- 
ing to OMIM records, are: IRSl, INSR, IPF1, SLC2A4, 
PPP1R3A and TCF1, all ranked in the top 6 of the obtained 
prioritized list of genes. New putative candidate genes were 
discovered in the top 10. PLN (ranked 7th) was not related 
to Diabetes in the corresponding OMIM entry, however 



[33] reports a role of PLN in diabetic cardiomyopathy. 
HADHSC was ranked 8th since it has been related to 
Hyperinsulinemic hypoglycemia [34,35]. The inferred rela- 
tionship between Diabetes and LEPRE1 (ranked 9th) can- 
not be derived from the published literature and further 
studies are required to study the possible connections of 



Table 2 Ranking positions and assigned scores for top prioritized genes for each case study 



Alzheimer Disease (MIM:1 04300) 



Gene 


Rank 


Score 


Gene 


Rank 


Score 


Gene 


Rank 


Score 


Gene 


Rank 


Score 


APP* 


1 


0.6639 


PSEN1 


4 


0.1946 


CST3 


7 


0.1511 


SNCA 


10 


0.1276 


PSEN2* 


2 


0.5462 


TREM2 


5 


0.1700 


ITM2B 


8 


0.1468 


APOE 


11 


0.1141 


MAPT 


3 


0.2531 


HD/HTT 


6 


0.1585 


TYROBP 


9 


0.1296 


NCSTN 


12 


0.1114 










Diabetes Mellitus Type 2 (MIM:1 25853) 










Gene 


Rank 


Score 


Gene 


Rank 


Score 


Gene 


Rank 


Score 


Gene 


Rank 


Score 


IRSl* 


1 


0.4744 


INSR* 


5 


0.2950 


LEPRE1 


9 


0.0976 


ABCC8 


37 


0.0456 


PPP1R3A* 


2 


0.4660 


TCF1* 


6 


0.2168 


LEPREL4 


10 


0.0976 








SLC2A4* 


3 


0.4194 


PLN 


7 


0.1164 


NEUROD1 


14 


0.0905 








I PF 1 * 


4 


0.3308 


HADHSC 


8 


0.0976 


KCNJ11 


30 


0.0595 








Breast Cancer (MIM:1 14480) 




















Gene 


Rank 


Score 


Gene 


Rank 


Score 


Gene 


Rank 


Score 


Gene 


Rank 


Score 


BRCA1* 


1 


0.5019 


PIK3CA* 


5 


0.3199 


ELAC2 


9 


0.1038 


ATM 


13 


0.0934 


RAD51* 


2 


0.4919 


MSH2 


6 


0.1636 


RAD51AP1 


10 


0.1031 


CHEK2 


29 


0.0551 


BRCA2* 


3 


0.4813 


RBI 


7 


0.1607 


RAD54L 


1 1 


0.1031 








NBN/NBS1* 


4 


0.3547 


TP53 


8 


0.1307 


FANCD2 


12 


0.1017 








Gene symbol, 


rank position 


and assigned 


score for genes 


in the top of the ran 


king for each case 


study. Entries marked with asterisks 


were directly connected by 



an arc to the disease of interest in the data network. 
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this gene to DM. Other interesting genes were ranked high, 
such as KCNJ11, ranked 30th, which presents polymorph- 
isms that confer susceptibility to Diabetes mellitus type 2 
[36]; or ABCC8, ranked 37th, whose mutations increase 
the risk of diabetes as suggested by [37]. 
Results for Breast Cancer 

We performed a prioritization for Breast cancer 
(MIM:114480). Previously known genes related to this dis- 
ease according to OMIM are: BRCA1, RAD51, BRCA2, 
NBN and PIK3CA, all included in the top 5 returned by 
ProphNet for this disease. 

New relations not explicitly represented in the data net- 
work were discovered in the top ranking. Defects in MSH2 
(ranked 6th) can cause different types of cancer as pointed 
out by [38]. RBI (ranked 7th) and TP53 (ranked 8th) act 
as tumour suppressors [39]. ELAC2 (ranked 9th) has not 
been associated with breast cancer but with prostate can- 
cer [40]. RADS1AP1 (10th) is closely related with RAD51 
(2nd) [41]. RAD54L (11th) plays an important role repair- 
ing and recombining DNA in mammalian cells [42]. 
FANCD2 (12th) interacts with the BRCA1 and BRCA2 
genes in the DNA repair process to reduce the risk of 
breast cancer [43]. ATM (13th) has been associated with 
the disease in various studies [44]. Other relevant genes 
were found below in the top list, such as CHEK2 (ranked 
29th), also associated to propensity to suffer breast cancer 
as shown by [45]. 

Conclusion 

In this paper we have introduced ProphNet, a novel 
network-based method that allows to accomplish any 
prioritization task from a network representing the corre- 
sponding data interactions. Our method is flexible and 
can be run on a graph composed of an arbitrary number 
of data networks representing biological entities of differ- 
ent type. This paper illustrates how to run ProphNet to 
perform gene-disease and domain-disease prioritization 
tasks, and provides experimental evidence that ProphNet 
outperforms other prioritization algorithms specifically 
designed for these tasks. A ProphNet web application has 
also been developed as a result of this work (the user 
guide can be found in Additional file 5). 

The results obtained by ProphNet on real case studies 
on Alzheimer, DM and Breast Cancer show the potential 
of the method to suggest putative candidate genes to be 
involved in a disease. A detailed analysis of the literature 
also allowed us to validate the results provided by the 
algorithm, since many of the top ranked genes were 
already known to be related to the diseases. We consider 
that prioritization methods are useful for assisting scien- 
tists at early research stages and to formulate novel 
hypotheses of interest. 

The extensive experimentation also allowed us to 
study the indirect influence of considering protein 



domains for the prioritization of candidate genes and 
diseases. Results show that the addition of domain inter- 
actions produces an obvious improvement with respect 
to existent algorithms, revealing the importance of this 
source of information (barely used before in this task). 
In the future, one of our main goals is to see how our 
method behaves in other prioritization problems and 
using different entities and sources of data not covered 
in this work. Furthermore, we plan to study in more 
detail the quality of the datasets and their influence on 
performance, and apply new methods of propagation to 
try to improve the results. 
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